{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c4181d83-8eef-4a5e-9b2b-9e459ced8e84",
   "metadata": {},
   "source": [
    "# Building a Live RAG Pipeline over Google Drive Files\n",
    "\n",
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/ingestion/ingestion_gdrive.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\n",
    "In this guide we show you how to build a \"live\" RAG pipeline over Google Drive files.\n",
    "\n",
    "This pipeline will index Google Drive files and dump them to a Redis vector store. Afterwards, every time you rerun the ingestion pipeline, the pipeline will propagate **incremental updates**, so that only changed documents are updated in the vector store. This means that we don't re-index all the documents!\n",
    "\n",
    "We use the following [data source](https://drive.google.com/drive/folders/1RFhr3-KmOZCR5rtp4dlOMNl3LKe1kOA5?usp=sharing) - you will need to copy these files and upload them to your own Google Drive directory! \n",
    "\n",
    "**NOTE**: You will also need to setup a service account and credentials.json. See our LlamaHub page for the Google Drive loader for more details: https://llamahub.ai/l/readers/llama-index-readers-google?from=readers\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a7caa90-8418-4b1b-8dc4-31ac81da39f3",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "We install required packages and launch the Redis Docker image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5179ede",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-storage-docstore-redis\n",
    "%pip install llama-index-vector-stores-redis\n",
    "%pip install llama-index-embeddings-huggingface\n",
    "%pip install llama-index-readers-google"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03f480f0-71e4-4d50-8efa-deae20172764",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "d32273cc1267d3221afa780db0edcd6ce5eee08ae88886f645410b9a220d4916\n"
     ]
    }
   ],
   "source": [
    "# if creating a new container\n",
    "!docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n",
    "# # if starting an existing container\n",
    "# !docker start -a redis-stack"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b63d3831-70e9-4b7b-b876-1143fd580c6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69b0320e-47d6-48a8-9ba1-d844bb887cb5",
   "metadata": {},
   "source": [
    "## Define Ingestion Pipeline\n",
    "\n",
    "Here we define the ingestion pipeline. Given a set of documents, we will run sentence splitting/embedding transformations, and then load them into a Redis docstore/vector store.\n",
    "\n",
    "The vector store is for indexing the data + storing the embeddings, the docstore is for tracking duplicates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f2871ad-1c14-49e4-b5ec-3e3eb96429f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "from llama_index.core.ingestion import (\n",
    "    DocstoreStrategy,\n",
    "    IngestionPipeline,\n",
    "    IngestionCache,\n",
    ")\n",
    "from llama_index.core.ingestion.cache import RedisCache\n",
    "from llama_index.storage.docstore.redis import RedisDocumentStore\n",
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "from llama_index.vector_stores.redis import RedisVectorStore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca63bf0d-9455-40cb-b30f-9a23e1990c08",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = RedisVectorStore(\n",
    "    index_name=\"redis_vector_store\",\n",
    "    index_prefix=\"vectore_store\",\n",
    "    redis_url=\"redis://localhost:6379\",\n",
    ")\n",
    "\n",
    "cache = IngestionCache(\n",
    "    cache=RedisCache.from_host_and_port(\"localhost\", 6379),\n",
    "    collection=\"redis_cache\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78043d63-bd88-4367-b883-5ad6075339ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optional: clear vector store if exists\n",
    "if vector_store._index_exists():\n",
    "    vector_store.delete_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3be817bd-81a1-436f-8f92-3eb48531c915",
   "metadata": {},
   "outputs": [],
   "source": [
    "embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\n",
    "\n",
    "pipeline = IngestionPipeline(\n",
    "    transformations=[\n",
    "        SentenceSplitter(),\n",
    "        embed_model,\n",
    "    ],\n",
    "    docstore=RedisDocumentStore.from_host_and_port(\n",
    "        \"localhost\", 6379, namespace=\"document_store\"\n",
    "    ),\n",
    "    vector_store=vector_store,\n",
    "    cache=cache,\n",
    "    docstore_strategy=DocstoreStrategy.UPSERTS,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a873168-f735-43cd-b511-0bb569f9c8b4",
   "metadata": {},
   "source": [
    "### Define our Vector Store Index\n",
    "\n",
    "We define our index to wrap the underlying vector store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5affc83-b5f0-40c9-a8a1-b4ddd67fa62b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex\n",
    "\n",
    "index = VectorStoreIndex.from_vector_store(\n",
    "    pipeline.vector_store, embed_model=embed_model\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "343f9de5-4373-458b-b6cf-a8173f3e9a52",
   "metadata": {},
   "source": [
    "## Load Initial Data\n",
    "\n",
    "Here we load data from our [Google Drive Loader](https://llamahub.ai/l/readers/llama-index-readers-google?from=readers) on LlamaHub. \n",
    "\n",
    "The loaded docs are the header sections of our [Use Cases from our documentation](https://docs.llamaindex.ai/en/latest/use_cases/q_and_a/root.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c1c0d28-f18a-4efc-b6bf-9173016df8ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.readers.google import GoogleDriveReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "983b9e84-b546-4b5a-9072-c6b3c8ec8699",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = GoogleDriveReader()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fb56b73-78f1-41c3-b93e-98d3a701f2c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_data(folder_id: str):\n",
    "    docs = loader.load_data(folder_id=folder_id)\n",
    "    for doc in docs:\n",
    "        doc.id_ = doc.metadata[\"file_name\"]\n",
    "    return docs\n",
    "\n",
    "\n",
    "docs = load_data(folder_id=\"1RFhr3-KmOZCR5rtp4dlOMNl3LKe1kOA5\")\n",
    "# print(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c77f74b2-9bbe-46d6-b35f-23ea757b315b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ingested 6 Nodes\n"
     ]
    }
   ],
   "source": [
    "nodes = pipeline.run(documents=docs)\n",
    "print(f\"Ingested {len(nodes)} Nodes\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae510add-f8d3-4fb3-a351-1cc7a5fe9e6b",
   "metadata": {},
   "source": [
    "Since this is our first time starting up the vector store, we see that we've transformed/ingested all the documents into it (by chunking, and then by embedding)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9687636-45e3-4038-b72e-b8c2d86baf56",
   "metadata": {},
   "source": [
    "### Ask Questions over Initial Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c54d927-3b56-4844-89c5-61a5aac1df6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6de7fa66-7f58-489c-abb8-d87b5d2936a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\"What are the sub-types of question answering?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87a41b20-6e72-4757-a697-a6f3d288c8dd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The sub-types of question answering mentioned in the context are semantic search and summarization.\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "800b5201-0b69-4419-9294-c03ee85b0755",
   "metadata": {},
   "source": [
    "## Modify and Reload the Data\n",
    "\n",
    "Let's try modifying our ingested data! \n",
    "\n",
    "We modify the \"Q&A\" doc to include an extra \"structured analytics\" block of text. See our [updated document](https://docs.google.com/document/d/1QQMKNAgyplv2IUOKNClEBymOFaASwmsZFoLmO_IeSTw/edit?usp=sharing) as a reference.\n",
    "\n",
    "Now let's rerun the ingestion pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d490fbb8-82ec-4284-a19d-1a8ca69da2a4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ingested 1 Nodes\n"
     ]
    }
   ],
   "source": [
    "docs = load_data(folder_id=\"1RFhr3-KmOZCR5rtp4dlOMNl3LKe1kOA5\")\n",
    "nodes = pipeline.run(documents=docs)\n",
    "print(f\"Ingested {len(nodes)} Nodes\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "768505db-02ee-4929-a8e7-1ee127356c98",
   "metadata": {},
   "source": [
    "Notice how only one node is ingested. This is beacuse only one document changed, while the other documents stayed the same. This means that we only need to re-transform and re-embed one document!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56ba5205-09d7-46f0-8a97-ac58c3f9b649",
   "metadata": {},
   "source": [
    "### Ask Questions over New Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b52d5f4b-7818-437b-ac33-ce8257e00048",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5cf1b0b-f6f1-45eb-ac61-a154a66c57d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\"What are the sub-types of question answering?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "486cce9e-567a-4ef4-8793-875880e09756",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The sub-types of question answering mentioned in the context are semantic search, summarization, and structured analytics.\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_v2",
   "language": "python",
   "name": "llama_index_v2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
